Local Optimality of User Choices and 
Collaborative Competitive Filtering 

Shuang Hong Yang 
College of Computing 
Georgia Institute of Technology 
Atlanta, GA 30332 

shy@gatech . edu 

Abstract 

While a user's preference is directly reflected in the interactive choice 
process between her and the recommender, this wealth of information was 
not fully exploited for learning recommender models. In particular, exist- 
ing collaborative filtering (CF) approaches take into account only the binary 
events of user actions but totally disregard the contexts in which users' de- 
cisions are made. In this paper, we propose Collaborative Competitive Fil- 
tering (CCF), a framework for learning user preferences by modeling the 
choice process in recommender systems. CCF employs a multiplicative la- 
tent factor model to characterize the dyadic utility function. But unlike CF, 
CCF models the user behavior of choices by encoding a local competition 
effect. In this way, CCF allows us to leverage dyadic data that was previ- 
ously lumped together with missing data in existing CF models. We present 
two formulations and an efficient large scale optimization algorithm. Exper- 
iments on three real-world recommendation data sets demonstrate that CCF 
significantly outperforms standard CF approaches in both offline and online 
evaluations. 

1 Introduction 



Recommender systems have become a core component for today's personalized 
online businesses Il23l l9ll. With the abilities of connecting various items (e.g., re- 
tailing products, movies, News articles, advertisements, experts) to potentially in- 
terested users, recommender systems enable online webshops (e.g. Amazon, Net- 
flix, Yahoo!) to expand the marketing efforts from historically a few best-sellings 
toward a large variety of long-tail (niche) products H HI |28l. Such abilities are 
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endowed by a personalization algorithm for identifying the preference of each in- 
dividual user, which is at the heart of a recommender system. 

Predicting user preference is challenging. Usually, the user and item spaces 
are very large yet the observations are extremely sparse. Learning from such rare, 
noisy and largely missing evidences has a high risk of overfitting. Indeed, this data 
sparseness issue has been widely recognized as a critical challenge for constructing 
effective recommender systems. 

A straightforward way for building recommender would be to learn a user's 
preference based on the prior interactions between her and the recommender sys- 
tem. Typically, such interaction is an "opportunity give-and-take" process (c.f. 
Table [j}, where at each interaction: 

1) a user u inquires the system (e.g. visits a movie recommendation web site); 

2) the system offers a set of (personalized) opportunities (i.e. items) O = {ii, . . . , i{\ 
(e.g. recommends a list of movies of potential interest to the user); 

3) the user chooses one item i* € O (or more) from these offers and takes 
actions accordingly (e.g. click a link, rent a movie, view a News article, 
purchase a product). 

Somewhat surprisingly, this interaction process has not been fully-exploited for 
learning recommenders. Instead, research on recommender systems has focused 
almost exclusively on recovering user preference by completing the matrix of user 
actions (u,i*) while the actual contexts in which user decisions are made are totally 
disregarded. In particular, Collaborative Filtering (CF) approaches only captures 
the action dyads (u, i*) while the contextual dyads (i.e. {(u, i)} for all j 6 O and 
i ^ i*) are typically treated as missing data. For example, the rating-oriented 
models aim to approximating the ratings that users assigned to items ll25l l20l l24l 
CD |6l H21 ; the recently proposed ranking-oriented algorithms f29l [T6ll attempt to 
recover the ordinal ranking information derived from the ratings. Although this 
formulation of the recommendation problem has led to numerous algorithms which 
excel at a number of data sets, including the prize- winning work of lfI31 . we argue 
here that the formulation is inherently flawed — a preference for Die Hard given a 
generic set of movies only tells us that the user appreciates action movies; however, 
a preference for Die Hard over Terminator or Rocky suggests that the user might 
favor Bruce Willis over other action heroes. In other words, the context of user 
choice is vital when estimating user preferences. 

When it comes to modeling of user-recommender interactions, an important 
question arises: what is the fundamental mechanism underlying the user choice 
behaviors? As reflected by its name, collaborative filtering is based on the notion of 
"collaboration effects" that similar items get similar responses from similar users. 
This assumption is essential because by encoding the "collaboration" among users 
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Table 1 : An example of user-recommender interactions and the derived observation 
matrix: entries with value "1" denote the action dyads; dyads that are observed 
without user actions (e.g. offered by the recommender but not picked by the user) 
are marked with dots ("•")• CF trains only on the 1 entries while the • entries 
are treated as missing data. CCF distinguishes between unseen entries and entries 
marked with dots. 

or among items or both, CF greatly alleviates the issue of data sparseness and 
in turn makes more reliable predictions based on the somewhat pooled evidences 
across different items/users. 

It has long been recognized in psychology and economics that, besides the 
effect of collaboration ||5] EQ, another mechanism governs users' behavior — 
competition ifTTl [T9l |31. In particular, items turn to compete with each other for 
the attention of users; therefore, axiomatically, user u will pick the best item 
i* (i.e. the one with highest utility) when confronted by the set of alternatives 
O. For example, consider a user with a penchant for action movies by Arnold 
Schwarzenegger. Given the choice between Sleepless in Seattle and Die Hard 
he will likely choose the latter. However, when afforded the choice between the 
oeuvres of Schwarzenegger, Diesel or Willis, he's clearly more likely to choose 
Schwar-zenegger over the works of Willis. To capture user's preference more ac- 
curately, it is therefore essential for a recommender model to take into account 
such local competition effect. Unfortunately, this effect is absent in a large number 
of collaborative filtering approaches. 

In this paper, we present Competitive Collaborative Filtering (CCF) for learn- 
ing recommender models by modeling users' choice behavior in their interactions 
with the recommender system. Similar to matrix factorization approaches for CF, 
we employ a multiplicative latent factor model to characterize the dyadic utility 
function (i.e. the utility of an item to a user). In this way, CCF encodes the collabo- 
ration effect among users and items similar to CF. But instead of learning only the 
action dyads (i.e. (u, i*) or the "1" entries in Tabled]), CCF bases the factorization 
learning on the whole user-recommender interaction traces. It therefore leverages 
not only the action dyads (u, i*) but also the dyads in the context without user ac- 
tions (i.e. (u, i) for all 2 G O and i ^ i* or the dot entries in TableQ]), which were 
treated as potentially missing data in CF approaches. 

To leverage the entire interaction trace for latent factor learning, we devise 
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probabilistic models or optimization objectives to encode the local competition ef- 
fect underlying the user choice process. We present two formulations with differ- 
ent flavors. The first formulation is derived from the multinomial logit model that 
has been widely used for modeling user choice behavior (e.g. choice of brands) 
in psychology [17], economics |[T8l [T9l and marketing science ITTTI . The second 
formulation relates closely to the ordinal regression models in content filtering |[T2l 
(e.g. web search ranking). Essentially, both formulations attempt to encodes "local 
optimality of user choices" to encourage that every opportunity i* taken by a user 
u be locally the best in the context of the opportunities O offered to her. From a 
machine learning viewpoint, CCF is a hybrid of local and global learning, where 
a global matrix factorization model is learned by optimizing a local context-aware 
loss function. We discuss the implementation of CCF, establish efficient learning 
algorithms and deliver an package that allows distributed optimization on stream- 
ing data. 

Experiments were conducted on three real-world recommendation data sets. 
First, on two dyadic data sets, we show that CCF improves over standard CF mod- 
els by up to 50+% in terms of offline top-A; ranking. Furthermore, on a commercial 
recommender system, we show that CCF significantly outperform CF models in 
both offline and online evaluations. In particular, CCF achieves up to 7% improve- 
ment in offline top- A; ranking and up to 13% in terms of online click rate prediction. 

Outline: $2] describes the problem formulation, the backgrounds and motivates 
CCF. ^presents the detailed CCF models, learning algorithms and our distributed 
implementation. Sj4] reports experiments and results. previews related work and 
^summarizes the results. 

2 Preliminaries 

2.1 Problem formulation 

Consider the user-system interaction in a recommender system: we have users 
u £ U := {1, 2, . . . , U} and items i € I = {1, 2, . . . , /}; when a user u visits the 
site, the system recommends a set of items O = . . . and u in turn chooses 
a (possibly empty) subset V Q O from O and takes actions accordingly (e.g. buys 
some of the recommended products). For ease of explanation, let us temporarily 
assume V = {i*}, i.e. V is not empty and contains exactly one item i*. More 
general scenarios shall be discussed later. 

To build the recommender system, we record a collection of historical interac- 
tions in the form of {(w t , Ot , V t )}, where t is the index of a particular interaction 
session. Our goal is to generate recommendations 0£ for an incoming visit t of 
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user ui such that the user's satisfaction is maximized. Hereafter, we refer to U as 
user space, X as item space, Ot as offer set or context, T>t as decision set, and i* as 
a decision. 

A key component of a recommender system is a model r(u, i) that character- 
izes the utility of an item i € 1 to a user u 6W, upon which recommendations for 
a new inquiry from user u could be done by simply ranking items based on r(u, i) 
and recommending the top-ranked ones. Collaborative filtering is by far the most 
well-known method for modeling such dyadic responses. 

2.2 Collaborative filtering 

In collaborative filtering we are given observations of dyadic responses {(u, i,y u i)} 
with each y u i being an observed response (e.g. user's rating to an item, or indica- 
tion of whether user u took an action on item i). The whole mapping: 

(u, i) — > y u i where u € U,i 6 X 

constitutes a large matrix Y € x '^L While we might have millions of users and 
items, only a tiny proportion (considerably less than 1% in realistic datasets) of en- 
tries are observable. Note the subtle difference in terms of the data representation: 
while we record entire sessions, CF only records the dyadic responses. 

Collaborative filtering explores the notion of "collaboration effects", i.e., sim- 
ilar users have similar preference to similar items. By encoding collaboration, CF 
pools the sparse observations in such a way that for predicting r(u, i) it also bor- 
rows observations from other users/items. Generally speaking, existing CF meth- 
ods fall into either of the following two categories. 

Neighborhood models. A popular class of approaches to CF is based on prop- 
agating the observations of responses among items or users that are considered 
as neighbors. The model first defines a similarity measure between items / users. 
Then, an unseen response between user u and item i is approximated based on the 
responses of neighboring users or items Il25ll20l . for example, by simply averaging 
the neighboring responses with similarities as weights. 

Latent factor models. This class of methods learn predictive latent factors to 
estimate the missing dyadic responses. The basic idea is to associate latent factor^ 
4> u E R k for each user u and fa £ R fc for each item i, and assume a multiplicative 
model for the dyadic response, 

p(yui\u,i) = p(y U i\r ui ;@), 

'Throughout this paper, we assume each latent factor 4> contains a constant component so as to 
absorb user/item-specific offset into latent factors. 
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where denotes the set of hyper-parameters, the utility is assumed as a multiplica- 
tive function of the latent factors, 

r(u,i) = (j) T a (t)i. 

This way the factors could explain past responses and in turn make prediction for 
future ones. This model implicitly encodes the Aldous-Hoover theorem |[T3l for 
exchangeable matrices - y ui are independent of each other given <f) u and In 
essence, it amounts to a low-rank approximation of the matrix Y that naturally 
embeds both users and items into a vector space in which the distances directly 
reflect the semantic relatedness. 

To design a concrete model (2l|22l|26l, one needs to specify a distribution for 
the dependence. Afterwards, the model boils down to an optimization problem. 
For example two commonly-used formulations are: 

- £2 regression The most popular learning formulation is to minimize the £2 loss 
within an empirical risk minimization framework lTT4l l24l : 



min ^ (Vm ~ 4>l4>i) 2 + \\4>u\\ 2 + X xYl 




(u,i)eQ u£U iex 

where Q denotes the set of (u, i) dyads for which the responses y u i are ob- 
served, \u and Ax are regularization weights. 
- Logistic Another popular formulation 1221 [Q is to use logistic regression by op- 
timizing the cross-entropy 

m j n log i 1 + ex P( _ ^« &)] 

+ A w ^||^|| 2 + Ax^||0 u || 2 

2.3 Motivating discussions 

Collaborative filtering approaches have made substantial progresses and are cur- 
rently the state-of-the-art techniques for recommender system. However, we argue 
here that CF approaches might be a bit lacking in several aspects. First of all, al- 
though data sparseness is a big issue, CF does not fully leverage the wealth of user 
behavior data. Take the user-recommender interaction process described in £12. li as 
an example (c.f. Tabled]), CF methods typically use only the action dyad (u,i*) 
of each session while other dyads {(u,i)\i G 0,i ^ i*} are treated missing and 
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totally disregarded, which could be wasteful of the invaluable learning resource be- 
cause these non-action dyads are not totally useless, as shown by the experiments 
in this paper. 

Secondly, most existing CF approaches learn user preference collaboratively 
by either approximating the dyadic responses {y u i* } |25l |20l 1241 [T] |6j [151 or pre- 
serving the ordinal ranking information derived from the dyadic responses ||29l[T6l : 
none of them models the user choice behavior in recommender systems. Particu- 
larly, as users choose from competing alternatives, there is naturally a local com- 
petition effect among items being offered in a session. Our work show that this 
effect could be an important clue for learning user preference. 

Because latent factor models are very flexible and could be under-determined 
(or over-parameterized) even for rather moderate number of users/items. With the 
above two limitations, CF approaches are vulnerable to over-fitting IH[T5l. Par- 
ticularly, while most existing CF models might learn consistently on user ratings 
(numerical value typically with five levels) if given enough training data, they usu- 
ally perform poorly on binary responses. For example, for the aforementioned 
interaction process (c.f. Table [D, the response y u i is typically a binary event indi- 
cating whether or not item i was accepted by the user u. With the non-action dyads 
being ignored, the responses are exclusively positive observations (either y u i = 1 
or missing). As a result, we will obtain an overly-optimistic estimator that biases 
toward positive responses and predicts positive for almost all the incoming dyads 
(See ^4. 1 I for empirical evidences). 

3 Collaborative competitive 
filtering 

We present a novel framework for recommender learning by modeling the system- 
user interaction process. The key insight is that the contexts Ot in which user's de- 
cisions are made should be taken into account when learning recommender models. 
In practice, a user u could make different decisions when facing different contexts 
Ot ■ For instance, an item i would not have been chosen by u if it were not presented 
to her at the first place; likewise, user u could choose another item if the context 
Ot changes such that a better offer (e.g., a more interesting item) is presented to 
her. 

In this section, we describe the framework of collaborative-competitive filter- 
ing. We start with some axiomatic views of the user choice behaviors. Following 
that, we present the learning formulation of CCF We then develop the optimization 
algorithms and implementation techniques. We close the section with a discussion 
of useful extensions. 
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3.1 Local optimality of user choices 

Formally, the individual choice process (i.e. user-recommender interactions) in a 
recommender system can be viewed as an instance of the opportunity give-and-take 
(GAT) process. 

DEFINITION [GAT]: An opportunity give-and-take process is a process of interac- 
tions among an agent u, a system S and a set of opportunities X; at an interaction 
t: 

- u is given a set of opportunities Ot C 1 by S; 

- u makes the decision by takeing one of the opportunities: i\ G Ot', 

- Each opportunity i G Ot could potentially give u a revenue (utility) ofr u i if 
being taken or otherwise. 

Note that we assume the agent is a priori not aware of all the items, and only 
through the recommender S can she get to know the items, therefore other items 
that are not in Ot is unaccessible to u at interaction t. This is reasonable consider- 
ing that the number of item is usually very large. Moreover, we assume an agent u 
is a rational decision maker: she knows that her choice of item i will be at the ex- 
pense of others i' G Ot, therefore she compares among alternatives before making 
her choice. In other words, for each decision, u considers both revenue and op- 
portunity cost, and decides which opportunity to take based on the potential profit 
of each opportunity in O. Specifically, the opportunity cost c u i is the potential 
loss of u from taking an opportunity i that excludes her to take other opportunities: 
c u i = max{r u j/ : i' G O \ i}; the profit ir u i = r u i — c u i is the net gain of an 
decision. By drawing the rational decision theory ifTTTl . we present the following 
principle of individual choice behavior. 

PROPOSITION: A rational decision is a decision maximizing the profit: i* = 

arg maxjgo Tr ui . 

This proposition implies the constraint of "local optimality of user choice", a 
local competitive effect restricting that the agent u always chooses the offer that is 
locally optimal in the context of the offer set Ot- 

3.2 Collaborative competitive filtering 

The local-optimality principle induces a constraint which could be translated to an 
objective function for recommender learning: 

Vi* G T> t , r ui * ^ max{r Mi |i G O t \ T> t } 
or P(i* is taken) = P(r u i* ^ max{r u j|i G Ot \ T> t }). (1) 
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This objective is, however, problematic. First, the inequality constraint restricts 
the utility function only up to an arbitrary order-preserving transformation (e.g. 
a monotonically increasing function), and hence cannot yield a unique solution 
(e.g. point estimation) [18]. Second, optimization based on the induced objective 
is computationally intractable due to the max operator. To this end, we present 
two surrogate objectives, which both are computationally efficient and show close 
connections to existing models. 

3.2.1 Softmax model 

Our first formulation is based on the random utility theory liT71[T8ll which has been 
extensively used for modeling choice behavior in economics [19] and marketing 
science ifTTl . In particular, we assume the utility function consists of two com- 
ponents r u i + e u i, where: (1) r u i is a deterministic function characterizing the 
intrinsic interest of user u to item i, for which we use the latent factor model to 
quantify r u { = (2) the second part e u i is a stochastic error term reflecting 

the uncertainty and complexness of the choice proces^l. Furthermore, we assume 
the error term e u i is an independently and identically distributed Weibull (extreme 
point) variable: 

Pr(e ui ^ e) = e~ e ~\ 

Together with the local-optimality principle, these two constraints yield the 
following multinomial logit model lfl9l[T8l[TT1l : 

p(i* = i\u,0) = = for allied (2) 

Intuitively, this model enforces the local-optimality constraint by using the softmax 
function as a surrogate of max. 

Given a collection of training interactions {(ut, Ot, it)}, the latent factors can 
be estimated using penalized maximum likelihood via 

m i n Yl log [ Yl exp (0«t & )] - <t>Zt ( 3 ) 
t ieOt 

+ A w ^||0 u || 2 + Ai^||</» n || 2 . 

While the above formulation is a convex optimization w.r.t. r„j as each of the 
objective terms in Eq.© is strongly concave, it is nonconvex w.r.t. the latent factors 
4>. We postpone the discussion of optimization algorithms to ^3.31 

2 The error term essentially accounts for all the subtle, uncertain and unmeasurable factors that 
influence user choice behaviors, for example, a user's mood, past experience, or other factors (e.g., 
whether the decision is made in a hurry, together with her friends, or totally unconsciously) 
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3.2.2 Hinge model 

Our second formulation is based on a simple reduction of the local-optimality con- 
straint. Note that, from Eq(Q]), it follows that: 

P(i = i*\u,0) = P{{r ui * - r ui ) > (e ui - e ui *), V« G O) 
< P({r u i* -r^) > {e u --e ul *)), 

where f^- = y^pj Y2ieo\i* Tui * s tne avera g e potential utility that u could possi- 
bly gain from the non-chosen items. Intuitively, the above model encourages that 
the utility difference between choice and non-chosen items, r u j» — f |, to be non- 
trivially greater than random errors. Based on this notion, we present the following 
formulation which views the task as a pairwise preference learning problem |[i~2l 
and uses the non-choices averagely as negative preferences. 

min ^ + A w ^||<^|| 2 + Ax^||(/> U || 2 (4) 
t ueu iex 

s ' t: rui * 1 ~ \o \- 1 ^ Tui - 1 ~ ^ and ^ ~ °" 

This formulation is directly related to the maximum score estimation [ 1 8 ] of the 
multinomial logit model Eq©. Intuitively, it directly reflects the insight that user 
decisions are usually made by comparing alternatives and considering the differ- 
ence of potential utilities. In other words, it learns latent factors by maximizing the 
marginal utility between user choice and the average of non-choices. 

Again, the optimization is convex w.r.t. r u i, but nonconvex w.r.t. the la- 
tent factors, therefore the standard optimization tools such as the large variety of 
RankSVM lfT2l solvers are not directly applicable. 

3.2.3 Complexity 

It is worth noting that our CCF formulations have an appealing linear complexity, 
0(\X\ x \0\), where the offer size \0\ is typically a very small number. For ex- 
ample, Netflix recommends \0\ = 7 movies for each visit, and Yahoo! frontpage 
highlights \0\ = 4 hot news for each browser. Therefore, CCF has the same-order 
complexity as the rating-oriented CF models. Note that the ranking-oriented CF 
approaches ||29l [i~6l are much more expensive - for each user u, the learning com- 
plexity is quadratic 0(|X| 2 ) as they learn preference of each user by comparing 
every pair of the items. 
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3.3 Learning algorithms 



As we have already mentioned, due to the use of bilinear terms, both of the two 
CCF variants are nonconvex optimization problems regardless of the choice of 
the loss functions. While there are convex reformulations for some settings they 
tend to be computationally inefficient for large scale problems as they occur in 
industry — the convex formulations require the manipulation of a full matrix which 
is impractical for anything beyond thousands of users. 

Moreover, the interactions between user and items change over time and it is 
desirable to have algorithms which process this information incrementally. This 
calls for learning algorithms that are sufficiently efficient and preferably capable 
to update dynamically so as to reflect upcoming data streams, therefore excluding 
offline learning algorithms such as classical SVD-based factorization algorithms 
lfT31 or spectral eigenvalue decomposition methods lfl6l that involve large-scale 
matrices. 

We use a distributed stochastic gradient variant with averaging based on the 
Hadoop MapReduce framework. The basic idea is to decompose the objectives in 
© or (01) by running stochastic optimization on sub-blocks of the interaction traces 
in parallel in the Map phase, and to combine the results for fa in the Reduce phase. 
The basic structure is analogous to Il6ll32l. 

Stochastic Optimization. We derive a stochastic gradient descent algorithm to 
solve the optimization described in Eq© or Eq©. The algorithm is computa- 
tionally efficient and decouplable among different interactions and users, therefore 
amenable for parallel implementation. 

The algorithm loops over all the observations and updates the parameters by 
moving in the direction defined by negative gradient. Specifically, we can carry 
out the following update equations on each machine separately: 



3 We carry out an annealing procedure to discount r\ by a factor of 0.9 after each iteration, as 
suggested by 1 14 1. 




(5) 



(6) 
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where H(A is the Heaviside function, i.e. H(x) = 1 if x > and H(x) = 
otherwisen 



Feature Hashing. A key challenge in learning CCF models on large-scale data 
is that the storage of parameters as well as observable features requires a large 
amount of memory and a reverse index to map user IDs to memory locations. In 
particular in recommender systems with hundreds of millions of users the memory 
requirement would easily exceed what is available on today's computers (100 mil- 
lion users with 100 latent feature dimensions each amounts to 40GB of RAM). We 
address this problem by implementing feature hashing |j3Ql on the space of matrix 
elements. In particularly, by allowing random collisions and applying hash map- 
ping to the latent factors (i.e. 0), we keep the entire representation in memory, thus 
greatly accelerating optimization. 

3.4 Extensions 

We now discuss two extensions of CCF to address the fact that in some cases users 
choose not to respond to an offer at all and that moreover we may have observed 
features in addition to the latent representation discussed so far. 

Sessions without response 

In establishing the CCF framework for modeling the user choice behavioral data, 
we assumed that for each user-system interaction t, the decision set V t contains 
at least one item. This assumption is, however, not true in practice. A user's 
visit at a recommender system does not always yields an action. For example, 
users frequently visit online e-commerce website without making any purchase, 
or browse a news portal without clicking on an ad. Actually, such nonresponded 
visits may account for a vast majority of the traffics that an recommender system 
receives. Moreover, different users may have different propensities for taking an 
action. Here, we extend the multinomial logit model to modeling both responded 
and nonresponded interactions, (ut, Ot,i*) and (ut, Of, 0) respectively. 

This is accomplished by adding a scalar 6 U for each user u to capture the action 
threshold of user u. We assume that, at an interaction t, user ut takes an effective 
action only if she feels that the overall quality of the offers Ot are good enough and 
worth the spending of her attention. In keeping with the logistic model this means 

4 In our implementation, we approximate this by the continuous function . This helps 

with convergence. 



12 



that 



exp( 



exp(0 u ) + 2^ je(D exp(^ fa) 

for all j G O and the probability of no response is given by the remainder, that is 
by — , n , : 5iy(M — , . in essence, this amounts to a model where the 'non- 

response' has a certain reserve utility that needs to be exceeded for a user to re- 
spond. We may extend the hinge model in the same spirit (we use a trade-off 
constant C > to calibrate the importance of the non-responses). 

min + Ve, + A M V ||<M| 2 + A Z V||</>«|| 2 

subject to r uti | - r Ut - - 9 U ^ 1 - & for all «* G P t if A / 
0«-r Ut i ^ l-e t , Vi G O t iiV t = $ 
r u i = <fl<t>i and 6^0 and e t ^ (8) 



Table 2: Statistics of data sets. 





#user 


#item 


#dyads 


offer size 


Social 


1.2M 


400 


29M 




Netflix-5star 


0.48M 


18K 


100M 




News 


3.6M 


2.5K 


110M 


4 



Content features 

In previous sections, we use a plain latent factor model for quantifying utility, i.e. 
fui = (flfii- A known drawback (H of such model is that it only captures dyadic 
data (responses), and therefore generalizes poorly for completely new entities, i.e. 
unseen users or items, of which the observations are missing at the training stage. 
Here, we extend the model by incorporating content features. In particular, we 
assume that, in addition to the latent features 0s, there exist some observable prop- 
erties x u G E w (e.g. a user's self-crafted registration files) for each user u, and 
x,i G W 1 (e.g. a textual description of an item) for each item i. We then assume the 
utility r u i as a function of both types of features (i.e. observable and latent): 

r u i ~ P(r u i\4>u(pi + x^Mxf, Q) 
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where the matrix M € W mxn provides a bilinear form for characterizing the utility 
based on the content features of the corresponding dyads. This model integrates 
both collaborative filtering [15] and content filtering [7]. On the one hand, if the 
user u or item i has no or merely non-informative observable features, the model 
degrades to a factorization-style utility model 11241 . On the other hand, if we as- 
sume that <p u and <pi are irrelevant, for instance, if i or j is totally new to the system 
such that there is no interaction involving either of them as in a cold-start setting, 
this model becomes the classical content-based relevance model commonly used 
in, e.g. webpage ranking 0T1 . advertisement targeting [6], and content recommen- 
dation Q. 

4 Experiments 

We report experimental results on two test-beds. First, we evaluate the CCF mod- 
els with CF baselines on two dyadic data sets with simulated choice contexts. The 
choice of simulated data generated from CF datasets was made since we are un- 
aware of any publicly available datasets directly suitable for CCF. Furthermore, 
we extend our evaluation to a more strict setting based on user-system interaction 
session data from a commercial recommender system. 

4.1 Dyadic response data 

We use dyadic data with binary responses, i.e. {(u, i, y u i)} where y u i 6 {1, missing} 
We compare different recommender models in terms of their top- A; ranking perfor- 
mance. 

Social network data. The first data set we used was collected from a commercial 
social network site, where a user expresses her preference for an item with an ex- 
plicit indication of "like". We examine data collected for about one year, involving 
hundreds of millions of users and a large collection of applications, such as games, 
sports, news feeds, finance, entertainment, travel, shopping, and local information 
services. Our evaluation focuses on a random subset consisting of about 400 items, 

1.2 million users and 29 million dyadic responses ("like" indications). 

Netflix 5 star data. For the sake of reproducibility of our results, we also report 
results on a data set derived from the Netflix prize dattH, one of the most famous 
public data sets for recommendation. The Netflix data set contains 480K users 
and 18K movies. We derive binary responses by considering only 5-star ratings as 
"positive" dyads and treating all the others as missing entries. 

• jhttp : / / www . netf lixprize ■ com| 
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Figure 1: Histograms of the predicted dyadic responses {y u i}'- while the predic- 
tions by CF (top) are over-optimistically concetrated on positive responses (i.e. 
predicting "relevant" for all possible dyads), the results obtained by CCF (bottom) 
demonstrate a more realistic power-law property. 
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Table 3: Top- A; ranking performance on two binary dyadic data sets with simulated 
contexts. 



Model 


AP@5 


AR@5 


nDCG@5 


Social 


CF 


h 


0.448 


0.230 


0.475 


CF 


Logistic 


0.449 


0.230 


0.476 


CCF 


Softmax 


0.688 


0.261 


0.704 


CCF 


Hinge 


0.686 


0.260 


0.702 


Netflix-5star 


CF 


h 


0.135 


0.022 


0.145 


CF 


Logistic 


0.135 


0.023 


0.146 


CCF 


Softmax 


0.186 


0.033 


0.189 


CCF 


Hinge 


0.185 


0.032 


0.188 



For both data sets, we randomly split the data into three pieces, one for training, 
one for testing and the other for validation. 

Evaluation metrics. We assess the recommendation performance of each model 
by comparing the top suggestions of the model to the true actions taken by a user 
(i.e. "like" or 5-star). We consider three measures commonly used for accessing 
top-/c ranking performance in the IR community: 

AP is the average precision. AP@n averages the precision of the top-n ranked list 
of each query (e.g. user). 

AR or average recall is the average recall of the top-n rank list of each query. 

nDCG or normalized Discounted Cumulative Gain is the normalized position- 
discounted precision score. It gives larger credit to top positions. 

For all the three metrics we use n = 5 since most social networks and movie 
recommendation sites recommend a similar number of items for each user visit. 

Evaluation protocol. We compare the two CCF models (i.e. Softmax and Hinge) 
with the two standard CF factorization models (i.e. £2 and Logistic) described in 
Q2.2\ For dyadic data with binary responses, the Logistic CF model amounts to the 
state-of-the-art <22l[Tl. 

We adopt a fairly strict top-A; ranking evaluation. For each user, we assess the 
top results out of a total preference ordering of the whole item set. In particular, 
for each user u, we consider all the items as candidates; we compute the three 
measures based on the comparison between the ground truth (the set of items in 
the test set that user u actually liked) and the top-5 suggestions predicted by each 
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model. For statistical consistency, we employ a cross-validation style procedure. 
We learn the models on training data with parameters tuned on validation data, and 
then apply the trained models to the test data to assess the performance. All three 
measures reported are computed on test data only, and they are averaged over five 
random repeats (i.e. random splits of the data)j^| 

To render the data compatible with CCF we simulate a fixed-size pseudo-offer 
set for each interaction. Specifically, for every positive observation, e.g. y ui = 1, 
we randomly sample a handful set of missing (unobserved) entries {y u i'}i'=i:m- 
These sampled dyads are then treated as non-choices, and together with the positive 
dyad, they are used as the offer set for the current session. In our experiments, we 
choose m = 9 pseudo non-choices; in other words, we assume the offer size \O t \ 
= 10. 

Results and analysis. We report the mean scores in Table [3] Since the dataset are 
fairly large the standard deviations of all values were below 0.001. Consequently 
we omitted the latter from the results. As can be seen from the table, CCF dramati- 
cally outperforms CF baselines on both data sets. In terms of AP@5, the two CCF 
models gain about 52.8%-53.6% improvements compared to the two CF models 
on the Social data, and by 37.0%-37.8% on Netflix-5 Star. Similar comparisons 
apply to the nDCG@5 measure. And in terms of the AR@5, CCF models outper- 
form CF competitors by up to 13.5% on Social, and 30% on Netflix-5 Star data. 
All these improvements are statistically highly significant. Note that these results 
are quite consistent: both CF models perform comparably with each other on both 
data sets; the performance of the two CCF variants is also comparable; between 
the two groups, there are noticeable gaps. 

One argument we made in this paper for motivating our work is that since the 
CF models disregard the context information and only learns on positive (action) 
dyads, they almost inevitably yield overly-optimistic predictions (i.e. predicting 
positive for all possible dyads). We hypothesize that such estimation bias is one 
of the key reasons for the inability of CF models in learning binary dyadic data. 
As an empirical validation, in Figure [Q we plot the histograms of the predicted 
dyadic responses y u i (i.e. entries of the diffused matrices) obtained by a CF model 
(li) and a CCF model respectively As we can see, the CF model indeed predicts 
"positive" for most (if not all) dyads; in contrast, the results obtained by the CCF 
model demonstrate a more realistic power-law distribution [8]J^| 

6 Note that the contextual information ( the offer set Ot for each interaction t) is missing for both 
of the two dyadic data sets. We choose the datasets anyway to ensure that the results (at least on the 
Netflix dataset) can be repeated by other research groups. Results on interaction data are reported in 

EH 

'Similar results obtained with other losses. 

8 Note that the distribution starts at around 0.5 instead of 0, which is consistent with our intuitions 
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In reality, each user can only afford to "like" a few items out of a huge amount 
of alternatives. This power-law property is crucial for information filtering because 
we are intended to identify a few truly relevant items by filtering out many many 
irrelevant ones. A power-law recommender is desirable in a way analogous to 
a filter with narrow-bandwidth, which effectively filters the noises (i.e. irrelevant 
items) and only let the true signal (i.e. relevant items) pass to the endnode (i.e. 
users). 

4.2 User-system interaction data 

We now move on to a more realistic evaluation by applying CCF to real user-system 
interaction data. We evaluate CCF in both an offline test and an online test while 
comparing its results to both CF baselines. 

Data. We collected a large-scale set of user-system interaction traces from a com- 
mercial article (News feeds) recommender system. In each interaction, the sys- 
tem offers four personalized articles to the visiting user, and the user chooses one 
of them by clicking to read that article. The recommendations are dynamically 
changing over time even during the user's visit. The system regularly logs every 
click event of every user visit. It also records the articles being presented to users 
at a series of discrete time points. To obtain a context set for each user-system 
interaction, we therefore trace back to the closest recording time point right before 
the user-click, and we use the articles presented at that time point as the offer set 
for the current session. We collected such interaction traces from logged records of 
over one month. We use a random subset containing 3.6 million users, 2500 items 
and over 110 million interaction traces. Learning an effective recommender on this 
data set is particularly challenging as the article pool is dynamically refreshing, 
and each article only has a lifetime of several hours — it only appears once within 
a particular day, is pulled out from the pool afterward never to appear again. 

Evaluation protocol. We consider the following two evaluation settings, one of- 
fline and the other offline. 

Offline evaluation Similar to the evaluations presented in ^4.11 we evaluate the 
learned recommender models in terms of the top-A; ranking performance on 
a hold-out test subset. We follow the same configurations in ^4. H and use the 
three ranking measures, i.e. AP@n, AR@n and nDCG@n as the evaluation 

that there is actually no truly "irrelevant" item for a user - any item has potential utility for a user; 
user choose one over another based on the relative preference rather than absolute utility. This is true 
especially in this era of information explosion, where a user is typically facing so many alternatives 
that she can only pick the one she likes most while ignoring the others. 
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Table 4: Offline test (top-fc ranking performance) on user-system interaction data. 



Model 


AP@4 


AR@4 


nDCG@4 


30% Training 


CF 






0.245 


0.261 


0.255 


CF 


Logistic 




0.246 


0.263 


0.257 


CCF 


Softmax 




0.262 


0.278 


0.274 


CCF 


Hinge 




0.261 


0.278 


0.273 


50% Training 


CF 






0.250 


0.273 


0.268 


CF 


Logistic 




0.252 


0.276 


0.269 


CCF 


Softmax 




0.266 


0.285 


0.278 


CCF 


Hinge 




0.265 


0.285 


0.277 


70% Training 


CF 


h 




0.253 


0.275 


0.271 


CF 


Logistic 




0.253 


0.276 


0.274 


CCF 


Softmax 




0.267 


0.287 


0.280 


CCF 


Hinge 




0.267 


0.286 


0.280 



metrics. Note that here we use n = 4 instead of 5, because it is the default 
recommendation size used in the news recommender system. 
Online evaluation We further conduct an online test. In particular, for each in- 
coming interaction, we use the trained models to predict which item among 
the four recommendations will be taken by the user. This prediction is of cru- 
cial importance because one of the key objectives for a recommender system 
is to maximize the traffic and monetary revenue by lifting the click-through 
rate. 

Offline test results. In this setting, we train each model on progressive proportions 
of 30%, 50% and 70% randomly-sampled training data respectively, and evaluate 
each trained model in terms of offline top-fc ranking performance. The results 
are reported in Table 01 The two CCF models greatly outperform the two CF 
baselines in all the three evaluation metrics. Specifically, CCF models gain up 
to 6.9% improvement over the two CF models in terms of average precision; up 
to 6.5% in terms of average recall, and up to 7.5% in terms of nDCG. We also 
conducted a i-test with a standard 0.05 significance level. The hypothesis tests 
indicate that all the improvements obtained by CCF are significant. 

It is worth noting that the improvements obtained by CCF compared to CF 
baselines are especially evident when the training data are sparser (e.g. using only 
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30% of training data). This observation empirically validates our argument that the 
contexts contain substantial useful information for learning recommender models 
especially when the dyadic action responses are scarce. 

Table 5: Online test (predicted click probability) on user-system interaction data. 



Model 


30%train 


50%train 


70%train 


Random 


0.250 


CF £ 2 


0.337 


0.343 


0.347 


CF Logistic 


0.341 


0.345 


0.347 


CCF Softmax 


0.376 


0.384 


0.391 


CCF Hinge 


0.377 


0.385 


0.391 



The offline results obtained by CCF are quite satisfactory. For example, the 
average precision is up to 0.276, which means, out of the four recommended items, 
on average 1.1 are truly "relevant" (i.e. actually being clicked by the user). This 
performance is quite promising especially considering that most of the articles in 
the content pool are transient and subject to dynamically updating. 

Online test results. We further evaluate the online performance of each compared 
model by assessing the predicted click rates. Click-rate is essential for an online 
recommender system because it is closely-related to both the traffic and the revenue 
of a webshop. In our evaluation, for each of the incoming visits (ut, Of,i*), we 
use the trained models to predict the user choice, i.e. we ask the question: "among 
all the offered items i G Ot, which one will most likely be clicked?" We use the 
trained model to rank the items in the offer set, and compare the top-ranked item 
with the item that was actually taken (i.e. ijf) by user ut- We evaluate the results in 
terms of the prediction accuracy. 

The results are given in Table[5] Because the size of each offer set in the current 
data set is 4, a random predictor yields 0.25. As seen from the table, while all the 
four models obtain significantly better predictions than the random predictor, the 
two CCF models further greatly outperform the two CF models. Specifically, we 
observe 11.3%-12.7% improvements obtained by CCF models compared to the 
two CF competitors. These results are quite significant especially considering the 
dynamic property of the system. 

Impact of parameters. The performance of the two CCF models is affected by 
the parameter settings of the latent dimensionality, k, as well as the regularization 
weights, Xx and \u- In Figured we illustrate how the offline top- A; ranking per- 

9 Due to heavy computational consumptions, these results are obtained on a relatively small subset 
of data. 
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Figure 2: Offline top-A; ranking performance (nDCG@5) as a function of latent 
dimensionality (top) and regularization weight (bottom). 
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formance changes as a function of these parameters, where we use the same value 
for both Aj and Xjj. Here we only reported the results with nDCG@5 measure 
because the results show similar shapes when other measures (including the click 
rate) are used. As can be seen from the Figure, the nDCG curves are typically in 
the inverted U-shape with the optimal values achieved at the middle. In particu- 
lar, for both the two CCF models, the dimensionality around 10 and regularization 
weight around 0.0001 yield the best performance, which is also the default param- 
eter setting we used in obtaining our reported results. 

Nonresponded sessions. In Section 13.41 we presented two models for encoding 
nonresponded interactions, e.g. a user visits the News website but does not click 
any of the recommended articles. These approaches are promising because com- 
pared to the responded sessions, the nonresponded ones are typically much more 
plentiful and if learned successfully, this wealth of information has a potential to 
alleviating the critical data-sparse issue in recommendation. 

Unfortunately, due to the data-logging mechanism of the News recommender 
system, we were unable to obtain such nonresponded interactions (this is subject 
to future work). Instead, for a preliminary test, we conducted evaluation on a small 
set of pseudo nonresponded sessions that are derived from the responded ones. In 
particular, we hold out a randomly-sampled subset of sessions; for each of these 
sessions, we hide the item being clicked by the user, and use the remaining items 
as a nonresponded context set by assuming no click for this set. We augmented 
this set of derived nonresponded sessions to the training set, and train the model 
on the combined training data. The results from this preliminary evaluation did not 
show significant performance improvement. This is likely due to the fact that the 
surrogate distribution is invalid. A detailed analysis with more realistic data is the 
subject of future research. 

5 Related Work 

Although a natural reflection of a user's preference is the process of interaction with 
the recommender, to our knowledge, this interaction data has not been exploited for 
learning recommender models. Instead, research on recommender systems has fo- 
cused almost exclusively on learning the dyadic data. Particularly collaborative 
filtering approaches only capture the user-item dyadic data with explicit user ac- 
tions while the context dyads are typically treated missing values. For example, 
the rating-oriented models aim to approximating the ratings that users assigned to 
items |[231l20ll24l [Tll6l [T5ll : whereas the recently proposed ranking-oriented algo- 
rithms |29l [T6l attempt to recover the ordinal ranking information derived from the 
ratings. 
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By exploiting past records of user-item dyadic responses for future prediction 
based on either neighborhood based (25j |20j [HO or latent factor based methods 
ll24l [Tll6l [T5ll29l , collaborative filtering approaches encode the collaboration effect 
that similar users get similar preference on similar items. In this paper, by leverag- 
ing the user-recommender interaction data, we show that much better recommender 
performance can be obtained when a local-competition effect underlying the user 
choice behaviors is also encoded. 

The multinomial logit model we present is derived based on the random utility 
theory ifTTl [T8l . The model is well-established and has been widely used for a 
long time in, e.g. psychology ifTTIl . economics IHJUU and marketing science ifTTTl . 
Particularly, IfTTTl applied the model to examine the brand choice of households on 
grocery data; [10] showed this model is theoretically and empirically superior to 
the £2 regression model. More recently, the pioneering work of 13 first applied 
the model to characterize online choices in recommender system and investigated 
how recommender systems impact sales diversity. Following these steps, this work 
further employs the model to learn factorization models for recommendation. 

The Hinge formulation of CCF shows close connection to the pairwise pref- 
erence learning approaches widely used in Web search ranking lTT2l . Our model, 
however, differs from these content filtering models lfT2l in that instead of learn- 
ing a feature mapping as in lfl2l . our model uses the formulation for learning a 
multiplicative latent factor model. 

6 Summary and future research 

We presented a framework for learning recommender by modeling user choice be- 
havior in the user-system interaction process. Instead of modeling only the sparse 
binary events of user actions as in traditional collaborative filtering, the proposed 
collaborative-competitive filtering models take into account the contexts in which 
user decisions are made. We presented two models in this spirit, established effi- 
cient learning algorithms and demonstrated the effectiveness of the proposed ap- 
proaches with extensive experiments on three large-scale real-world recommenda- 
tion data sets. 

There are several promising directions for future research. 

Attention budget and position bias. When deriving the CCF model, we admit 
an assumption that user decides whether to take an offer solely based on the com- 
parison of utilities. This assumption, however, neglects a factor which might be 
important in practice. In particular, a user might have budgeted attention such that 
when making choices he only pays attention to a few top-ranked items and totally 
disregard the others. This position bias is evident in both web search ranking and 



23 



recommendation. We plan to take this into consideration for building choice mod- 
els. 

Recommender strategy and user behavior. A key feature of the current paper 
is that we assume the recommender adopts a deterministic strategic policy when 
making recommendations. In practice, a recommender could also adaptively react 
to the users' actions as well as its own considerations (e.g. inventory constraints, 
promotion requirement of certain brands). We would like to extend our analysis 
here to model the interactive process between users and recommender. 

Further empirical validation. Due to data collection constraints, some parts of 
the proposed models are not strictly evaluated in the current paper. We plan to 
refine the mechanism for data collection and conduct experiments for further ex- 
periments. 
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